Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
189
posts in
59.6
ms
🤖
AI
GitHub
·
5d
5 days ago
ahwurm/localharness:
Model-agnostic
agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
,
Ollama
, LM Studio, llama.cpp).
Covers
uv
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
🏗️
LLM Infrastructure
arxiv.org
·
2d
2 days ago
UltraQuant: 4-bit
KV
Caching
for Context-Heavy Agents
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for UltraQuant: 4-bit KV Caching for Context-Heavy Agents
🔓
Open Source AI
Anyscale blog posts
·
2d
2 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🧠
Memory Management
thecomputersciencebook.com
·
5d
5 days ago
PagedAttention
is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
🧠
Inference Serving
Towards AI
·
2d
2 days ago
Continuous
Batching
: How to Keep Your GPU Actually Busy
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Continuous Batching: How to Keep Your GPU Actually Busy
🏗️
LLM Infrastructure
vettedconsumer.com
·
5d
5 days ago
The
KV
Cache
, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
Covers
2 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
🆕
New AI
huggingface.co
·
2d
2 days ago
225B-A23B
Covered by
news.smol.ai
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 225B-A23B
🏗️
LLM Infrastructure
Martin Alderson
·
6d
6 days ago
A brief history of
KV
cache
compression developments
Covers
TurboQuant: Redefining AI efficiency with extreme compression
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for A brief history of KV cache compression developments
Less-relevant results
🔓
Open Source AI
mstar.stanford.edu
·
2d
2 days ago
M* (M-Star): A Modular, Extensible, Serving System for Multimodal
Models
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
🤖
AI
GitHub
·
11h
11 hours ago
Running a 35B MoE
model
on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
🏗️
LLM Infrastructure
abhishek.it
·
2d
2 days ago
Running GLM-5.2 5x faster at 500tps with limitation
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running GLM-5.2 5x faster at 500tps with limitation
🤖
AI
devashish.me
·
4d
4 days ago
Two Qwen3
models
on one DGX Spark: the residency math
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Two Qwen3 models on one DGX Spark: the residency math
🤖
AI
unsloth.ai
·
1d
1 day ago
GLM-5.2 – How to Run Locally
Covers
2 stories
See all stories this covers
including
GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...
Covered by
news.smol.ai
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for GLM-5.2 – How to Run Locally
🏗️
LLM Infrastructure
Google Cloud Blog
·
3d
3 days ago
Scaling Ray Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
🤖
AI
mlx-optiq.com
Content type:
Video
·
6d
6 days ago
Mlx-optiq: per-layer mixed-precision
LLM
quantization
for Apple Silicon
Covered by
GitHub
,
Nitter
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon
🏗️
LLM Infrastructure
arxiv.org
·
2d
2 days ago
SAC: Disaggregated
KV
Cache
System for Sparse
Attention
LLMs with CXL
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL
🤖
AI
GitHub
·
12h
12 hours ago
Show HN: Alloy – a PyTorch backend and
inference
engine for Apple Silicon
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon
🤖
AI
threadreaderapp.com
·
1d
1 day ago
A YouTuber just did what $60 billion in funding could not stop.
Covers
2 stories
See all stories this covers
including
Ollama
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for A YouTuber just did what $60 billion in funding could not stop.
🧩
MoE
huggingface.co
·
6d
6 days ago
coder543/command-a-plus-05-2026-gguf
Covers
3 stories
See all stories this covers
including
AlterLang InterCode: A Native Intercomprehension Paradigm in Programming, Powered by GuruDev
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for coder543/command-a-plus-05-2026-gguf
🤖
AI
rocm.blogs.amd.com
·
4d
4 days ago
Unlocking Extreme AMD Instinct
Inference
with Software-Hardware Co-Optimization
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report